Case Study 2 Bellabeat

Description of image

Ask

The purpose of this analysis is to gain a deeper understanding of the Leaf bracelet , a product from the wearable health and wellness company Bellabeat. By analyzing the usage data of the FITBIT smart device, I aim to learn more about how people are currently using their devices. Specifically, I want to identify trends and patterns in the data that will help us improve our understanding of consumer usage. To achieve this goal, I will need to analyze data on customer demographics, product usage, and sales data. We will also need to perform statistical analyses to identify significant trends and patterns in the data. Based on my findings, I will make recommendations for how to optimize our marketing efforts and improve product performance.

Preparing the data

Data Sources

The data source for this analysis is a public dataset called the FitBit Fitness Tracker Data, which was made available through Mobius on Kaggle under the CC0: Public Domain license. The data set contains personal fitness tracker information from 30 Fitbit users, who participated in a distributed survey conducted via Amazon Mechanical Turk between 03.12.2016-05.12.2016. The data includes minute-level output for physical activity, heart rate, and sleep monitoring, and provides insights into the daily habits of smart device users.

Tables used From FitBit Fitness Tracker Data

For my analysis, I used the following tables from the FitBit Fitness Tracker Data:

  • DailyActivity_merged
  • DailyIntensities_merged
  • Sleepday - sleepDay_merged
  • DailySteps_merged
  • Before analyzing the data, I downloaded the three CSV files and imported them into Big Query for cleaning. During the data preparation process, I encountered an error message related to the timestamp format in the Sleepday - sleepDay_merged table. The error indicated that Big Query was unable to parse the timestamp ‘4/12/2016 12:00:00 AM’ as a valid TIMESTAMP value due to the unrecognized time zone ‘AM’. To fix this error, I modified the timestamp format in the CSV file to a format that is recognized by Big Query.

    Process

    To clean the data, I used SQL within Big Query to modify each table individually. The cleaning process is shown below.

    Cleaning the data through SQL

  • Daily_activity table cleaning
  • Checking if there are any nulls in any columns in the daily_activity table

    Checking if there are any nulls in any columns in the daily_activity table

    Validating the match of survey dates 04.12.2016-05.12.2016 and the SedentaryActiveDistance column in the daily_activity table should not be below zero since Sedentary means inactive.

    Counting the highest and lowest calories burned and noticing that there is 0 calories as the lowest. Therefore finding how many particpants have only 0 calories burned.

    3 partipants have 0 calories for the same days of activity.

    Finding how many participants have a LoggedActivitiesDistance that is greater than zero since zero is typically registered

  • Daily_intense table cleaning
  • Checking if there are any nulls and duplicates any of the columns in the daily_intense table

    Checking that there is different activity days per participant and validating the match of survey dates 04.12.2016-05.12.2016

    Checking what is the most and least amount of minututes for the LightlyActiveMinutes and FairlyActiveMinutes column

    Checking and counting the Outliers in the FairlyActiveMinutes column

    Creating another column that is called fair_validity that shows if the SedentaryActiveDistance is below 74 to make that an invalid number and counting the amount of invalid entries there are

    Checking what is the most and least amount of minutes for the VeryActiveMinutes column and checking for outliers and the total count of outliers.

    Creating another column called very_validity that shows if VeryActiveMinutes is below 120 to make that an invalid number

    The SedentaryActiveDistance column in the daily_intense table should not be below zero since Sedentary means inactive. I counted the highest and lowest SedentaryActiveDistance and creating another colum that called SedentaryActiveDistance_validitity that shows if SedentaryActiveDistance is below 0.01 and making that an invalid number.

    Since there are inconsistencies in the decimal places for the columns LightActiveDistance, ModeratelyActiveDistance, and VeryActiveDistance I rounded all numbers by three decimal points.

  • Daily_sleep table cleaning
  • Checking if there are any nulls and duplicates in any columns in the daily_sleep table. Also checking if the the dates match the survey dates of 04.12.2016-05.12.2016.

    Checking what is the most and least amount of sleep records for the TotalSleepRecords and TotalMinutesAsleep column

    Creating another colum that called totalsleep_validitity that shows if TotalMinutesAsleep is below 61 and greater than 775 making that an invalid number

    Checking what is the most and least amount of assuming minutes in bed for the TotalTimeInBed column. Also,checking the number of Outliers and creating another colum that called TotalTimeInBed_validitity that shows if TotalTimeInBed is below 61 and greater than 961 making that an invalid number.

  • Daily_steps table cleaning
  • Checking if there are any nulls and duplicates in any columns in the daily_steps table. Also checking if the the dates match the survey dates of 04.12.2016-05.12.2016.

    Analyze

    Installing and loading packages


    install.packages('tidyverse')
    ## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
    ## (as 'lib' is unspecified)
    library(tidyverse)
    ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
    ## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
    ## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
    ## ✔ tidyr   1.2.1      ✔ stringr 1.4.0 
    ## ✔ readr   2.1.3      ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()
    install.packages('dplyr')
    ## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
    ## (as 'lib' is unspecified)
    library(dplyr)
    install.packages('lubridate')
    ## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
    ## (as 'lib' is unspecified)
    library(lubridate)
    ## Loading required package: timechange
    ## 
    ## Attaching package: 'lubridate'
    ## 
    ## The following objects are masked from 'package:base':
    ## 
    ##     date, intersect, setdiff, union
    install.packages('janitor')
    ## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
    ## (as 'lib' is unspecified)
    library(janitor)
    ## 
    ## Attaching package: 'janitor'
    ## 
    ## The following objects are masked from 'package:stats':
    ## 
    ##     chisq.test, fisher.test
    library(ggplot2)
    library(readr)

    Loading your CSV files

    daily_activity <- read.csv("/cloud/project/bellabeat/dailyActivity_merged.csv")
    daily_intense <- read.csv("/cloud/project/bellabeat/dailyIntensities_merged.csv")
    daily_sleep <- read.csv("/cloud/project/bellabeat/sleepday - sleepDay_merged.csv")
    daily_steps <- read.csv("/cloud/project/bellabeat/dailySteps_merged.csv")
    <

    Checking to see if all tables load

    head(daily_activity)
    ##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
    ## 1 1503960366    4/12/2016      13162          8.50            8.50
    ## 2 1503960366    4/13/2016      10735          6.97            6.97
    ## 3 1503960366    4/14/2016      10460          6.74            6.74
    ## 4 1503960366    4/15/2016       9762          6.28            6.28
    ## 5 1503960366    4/16/2016      12669          8.16            8.16
    ## 6 1503960366    4/17/2016       9705          6.48            6.48
    ##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
    ## 1                        0               1.88                     0.55
    ## 2                        0               1.57                     0.69
    ## 3                        0               2.44                     0.40
    ## 4                        0               2.14                     1.26
    ## 5                        0               2.71                     0.41
    ## 6                        0               3.19                     0.78
    ##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
    ## 1                6.06                       0                25
    ## 2                4.71                       0                21
    ## 3                3.91                       0                30
    ## 4                2.83                       0                29
    ## 5                5.04                       0                36
    ## 6                2.51                       0                38
    ##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
    ## 1                  13                  328              728     1985
    ## 2                  19                  217              776     1797
    ## 3                  11                  181             1218     1776
    ## 4                  34                  209              726     1745
    ## 5                  10                  221              773     1863
    ## 6                  20                  164              539     1728
    head(daily_intense)
    ##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes
    ## 1 1503960366   4/12/2016              728                  328
    ## 2 1503960366   4/13/2016              776                  217
    ## 3 1503960366   4/14/2016             1218                  181
    ## 4 1503960366   4/15/2016              726                  209
    ## 5 1503960366   4/16/2016              773                  221
    ## 6 1503960366   4/17/2016              539                  164
    ##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
    ## 1                  13                25                       0
    ## 2                  19                21                       0
    ## 3                  11                30                       0
    ## 4                  34                29                       0
    ## 5                  10                36                       0
    ## 6                  20                38                       0
    ##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
    ## 1                6.06                     0.55               1.88
    ## 2                4.71                     0.69               1.57
    ## 3                3.91                     0.40               2.44
    ## 4                2.83                     1.26               2.14
    ## 5                5.04                     0.41               2.71
    ## 6                2.51                     0.78               3.19
    head(daily_sleep)
    ##           Id  SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
    ## 1 1503960366 4/12/2016                 1                327            346
    ## 2 1503960366 4/13/2016                 2                384            407
    ## 3 1503960366 4/15/2016                 1                412            442
    ## 4 1503960366 4/16/2016                 2                340            367
    ## 5 1503960366 4/17/2016                 1                700            712
    ## 6 1503960366 4/19/2016                 1                304            320
    head(daily_steps)
    ##           Id ActivityDay StepTotal
    ## 1 1503960366   4/12/2016     13162
    ## 2 1503960366   4/13/2016     10735
    ## 3 1503960366   4/14/2016     10460
    ## 4 1503960366   4/15/2016      9762
    ## 5 1503960366   4/16/2016     12669
    ## 6 1503960366   4/17/2016      9705

    Joining daily_intense and daily_sleep

    I am joining these two data frames together to analyze how these two are correlated to each other. Through this join I will be answering different questions.

    sleep_and_intensity <- daily_intense %>%
      full_join(daily_sleep, by="Id")

    The maxium amount of very active minutes per particpant

    max_very_act_mins <- aggregate(VeryActiveMinutes ~ Id, sleep_and_intensity, max)
    
    print(max_very_act_mins)
    ##            Id VeryActiveMinutes
    ## 1  1503960366                78
    ## 2  1624580081               186
    ## 3  1644430081                51
    ## 4  1844505072                 2
    ## 5  1927972279                16
    ## 6  2022484408               104
    ## 7  2026352035                 3
    ## 8  2320127002                20
    ## 9  2347167796                66
    ## 10 2873212765                75
    ## 11 3372868164                24
    ## 12 3977333714                50
    ## 13 4020332650                65
    ## 14 4057192912                 3
    ## 15 4319703577                27
    ## 16 4388161847               120
    ## 17 4445114986                34
    ## 18 4558609924                66
    ## 19 4702921684                22
    ## 20 5553957443                69
    ## 21 5577150313               210
    ## 22 6117666160                26
    ## 23 6290855005                33
    ## 24 6775888955                70
    ## 25 6962181067                62
    ## 26 7007744171                64
    ## 27 7086361926                87
    ## 28 8053475328               132
    ## 29 8253242879                49
    ## 30 8378563200               137
    ## 31 8583815059                77
    ## 32 8792009665                10
    ## 33 8877689391               124

    Average amount of sleep per participant according to their maxium VeryActiveMinutes

    averages1 <- sleep_and_intensity %>%
      group_by(Id) %>%
      filter(VeryActiveMinutes == max(VeryActiveMinutes)) %>%
      summarize(avg_sleep_mins = mean(TotalMinutesAsleep))
    print(averages1)
    ## # A tibble: 33 × 2
    ##            Id avg_sleep_mins
    ##         <dbl>          <dbl>
    ##  1 1503960366           360.
    ##  2 1624580081            NA 
    ##  3 1644430081           294 
    ##  4 1844505072           652 
    ##  5 1927972279           417 
    ##  6 2022484408            NA 
    ##  7 2026352035           506.
    ##  8 2320127002            61 
    ##  9 2347167796           447.
    ## 10 2873212765            NA 
    ## # … with 23 more rows

    Graphing the relationship between max VeryActiveMinutes and TotalMinutesAsleep

    ggplot(averages1, aes(x = Id, y = avg_sleep_mins)) +
      geom_point(color = "blue") +
      geom_smooth(color = "red") +
      geom_text(aes(label = Id), size = 3, hjust = 0, vjust = 0) +
      xlab("Id") + ylab("Average Sleep Duration (minutes)") +
      ggtitle("Average Sleep Duration by participant according to their max VeryActiveMinutes")
    ## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    ## Warning: Removed 9 rows containing non-finite values (`stat_smooth()`).
    ## Warning: Removed 9 rows containing missing values (`geom_point()`).
    ## Warning: Removed 9 rows containing missing values (`geom_text()`).

    According to the graph the less minutes someone is active, the more minutes they tend to sleep.

    Joining daily_activity and daily_steps

    I am joining these two data frames together to analyze how these two are correlated to each other. Through this join I will be answering different questions.

    activity_and_steps <- daily_activity %>%
      full_join(daily_steps, by="Id")

    Showing the correlation between Total steps and Calories burned per participant from the new joined data frame

    steps_calories_by_id <- activity_and_steps %>%
      group_by(Id) %>%
      summarize(total_steps = sum(TotalSteps), calories = sum(Calories))
    print(steps_calories_by_id)
    ## # A tibble: 33 × 3
    ##            Id total_steps calories
    ##         <dbl>       <int>    <int>
    ##  1 1503960366    11644189  1745579
    ##  2 1624580081     5519891  1425504
    ##  3 1644430081     6554670  2530170
    ##  4 1844505072     2479442  1512118
    ##  5 1927972279      880400  2088067
    ##  6 2022484408    10927190  2412079
    ##  7 2026352035     5349763  1480560
    ##  8 2320127002     4532913  1656919
    ##  9 2347167796     3084372   662076
    ## 10 2873212765     7261099  1842206
    ## # … with 23 more rows

    Graphing the correlation between Calories and Totalsteps burned per participant from the new joined.

    ggplot(activity_and_steps, aes(x = Calories )) +
        geom_bar() +
      xlab("Totalcals") + ylab("Total steps") +
      ggtitle("Total Calories Burned by participants")

    Calculating the Average Total Steps per participant from the new joined data frame

    average_totalsteps <- activity_and_steps %>%
      group_by(Id) %>%
      summarize(mean_steps = mean(TotalSteps))
    print(average_totalsteps)
    ## # A tibble: 33 × 2
    ##            Id mean_steps
    ##         <dbl>      <dbl>
    ##  1 1503960366     12117.
    ##  2 1624580081      5744.
    ##  3 1644430081      7283.
    ##  4 1844505072      2580.
    ##  5 1927972279       916.
    ##  6 2022484408     11371.
    ##  7 2026352035      5567.
    ##  8 2320127002      4717.
    ##  9 2347167796      9520.
    ## 10 2873212765      7556.
    ## # … with 23 more rows

    Calculating the Average calories per participant from the new joined data frame

    average_calories <- activity_and_steps %>%
      group_by(Id) %>%
      summarize(mean_calories = mean(Calories))
    print(average_calories)
    ## # A tibble: 33 × 2
    ##            Id mean_calories
    ##         <dbl>         <dbl>
    ##  1 1503960366         1816.
    ##  2 1624580081         1483.
    ##  3 1644430081         2811.
    ##  4 1844505072         1573.
    ##  5 1927972279         2173.
    ##  6 2022484408         2510.
    ##  7 2026352035         1541.
    ##  8 2320127002         1724.
    ##  9 2347167796         2043.
    ## 10 2873212765         1917.
    ## # … with 23 more rows

    Calculating how many calories were burned by each participant correlating it their average of total steps

    average_cal_steps <- activity_and_steps %>%
      group_by(Id) %>%
      summarize(mean_steps = mean(TotalSteps),
                mean_calories = mean(Calories))
    print(average_cal_steps)
    ## # A tibble: 33 × 3
    ##            Id mean_steps mean_calories
    ##         <dbl>      <dbl>         <dbl>
    ##  1 1503960366     12117.         1816.
    ##  2 1624580081      5744.         1483.
    ##  3 1644430081      7283.         2811.
    ##  4 1844505072      2580.         1573.
    ##  5 1927972279       916.         2173.
    ##  6 2022484408     11371.         2510.
    ##  7 2026352035      5567.         1541.
    ##  8 2320127002      4717.         1724.
    ##  9 2347167796      9520.         2043.
    ## 10 2873212765      7556.         1917.
    ## # … with 23 more rows
    install.packages("ggplot2")
    ## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
    ## (as 'lib' is unspecified)
    library(ggplot2)
    
    ggplot(average_cal_steps, aes(x = mean_steps, y = mean_calories)) +
      geom_point() +
      geom_smooth() +
      xlab("Mean Steps") + ylab("Mean Calories Burned") +
      ggtitle("Mean Steps vs. Mean Calories Burned")
    ## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

    Joining daily_activity and daily_sleep

    I am merging these two data frames together to analyze how these two are correlated to each other. Through this join I will be answering different questions.

    df_merged <- merge(daily_sleep, daily_activity, by = "Id")

    Graphing the correlation between the TotalMinutesAsleep and VeryActiveDistanc

    install.packages("ggplot2")
    ## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
    ## (as 'lib' is unspecified)
    library(ggplot2)
    ggplot(df_merged, aes(x = TotalMinutesAsleep, y = VeryActiveDistance)) +
      geom_point()+
      ggtitle("Relationship Between Total Minutes Asleep and Very Active Distance")

    Graphing the correlation between VeryActiveMinutes and VeryActiveDistance

    ggplot(data = daily_activity) +
        geom_bar(mapping = aes(x = VeryActiveMinutes, y = VeryActiveDistance),
                 stat = "identity", fill = "#ae7855") +
        labs(title = "Very Active Distance compared to VeryActiveMinutes",
             subtitle = "Does more a more active state = a farther distance",
             x = "minutes",
             y = "distance") +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

    Act and Share

    Based on the analysis of Fitbit data, it was found that those who wore their tracker during very active moments tended to have lower average sleep times. This goes against the popular belief that increased physical activity leads to increased tiredness. In fact, participants who were less active actually slept more. Additionally, it was found that those who took more steps in a day burned more calories compared to those who took fewer steps. Participants who traveled greater distances while being active also tended to spend less time sleeping. These insights could potentially be applied to the Bellabeat Leaf. By understanding the relationship between physical activity and sleep, the Leaf could potentially help users to optimize their daily activity levels and improve their overall well-being.

    Some potential insights that Bellabeat could consider to apply to the leaf are:

  • Promote the idea that being more active during the day can help people sleep better at night, rather than just assuming that increased activity will lead to increased fatigue.
  • Consider developing features or strategies that help people track and optimize their activity levels in order to improve sleep quality.
  • Consider offering tips or resources for people who are struggling to get enough sleep, as this may be related to their activity levels or other factors